This is an IMDb data set containing 40000 movie reviews. Each row contains two columns with the text of the review and label. Our goal is to categorize reviews into sentiments and predict genre of the movie the review is about. We will attempt to accomplish this with tidyverse tools, sentiment analysis, tf-idf, correlations of bigrams, and LDA topic modeling. We start by loading in the csv and assign it as the data variable.

data <- read.csv("movie.csv")
head(data)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  text
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played "Thunderbirds" before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Gerry Anderson and his wife created. Jonatha Frakes should hand in his directors chair, his version was completely hopeless. A waste of film. Utter rubbish. A CGI remake may be acceptable but replacing marionettes with Homo sapiens subsp. sapiens was a huge error of judgment.
## 2 When I put this movie in my DVD player, and sat down with a coke and some chips, I had some expectations. I was hoping that this movie would contain some of the strong-points of the first movie: Awsome animation, good flowing story, excellent voice cast, funny comedy and a kick-ass soundtrack. But, to my disappointment, not any of this is to be found in Atlantis: Milo's Return. Had I read some reviews first, I might not have been so let down. The following paragraph will be directed to those who have seen the first movie, and who enjoyed it primarily for the points mentioned.<br /><br />When the first scene appears, your in for a shock if you just picked Atlantis: Milo's Return from the display-case at your local videoshop (or whatever), and had the expectations I had. The music feels as a bad imitation of the first movie, and the voice cast has been replaced by a not so fitting one. (With the exception of a few characters, like the voice of Sweet). The actual drawings isnt that bad, but the animation in particular is a sad sight. The storyline is also pretty weak, as its more like three episodes of Schooby-Doo than the single adventurous story we got the last time. But dont misunderstand, it's not very good Schooby-Doo episodes. I didnt laugh a single time, although I might have sniggered once or twice.<br /><br />To the audience who haven't seen the first movie, or don't especially care for a similar sequel, here is a fast review of this movie as a stand-alone product: If you liked schooby-doo, you might like this movie. If you didn't, you could still enjoy this movie if you have nothing else to do. And I suspect it might be a good kids movie, but I wouldn't know. It might have been better if Milo's Return had been a three-episode series on a cartoon channel, or on breakfast TV.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Why do people who do not know what a particular time in the past was like feel the need to try to define that time for others? Replace Woodstock with the Civil War and the Apollo moon-landing with the Titanic sinking and you've got as realistic a flick as this formulaic soap opera populated entirely by low-life trash. Is this what kids who were too young to be allowed to go to Woodstock and who failed grade school composition do? "I'll show those old meanies, I'll put out my own movie and prove that you don't have to know nuttin about your topic to still make money!" Yeah, we already know that. The one thing watching this film did for me was to give me a little insight into underclass thinking. The next time I see a slut in a bar who looks like Diane Lane, I'm running the other way. It's child abuse to let parents that worthless raise kids. It's audience abuse to simply stick Woodstock and the moonlanding into a flick as if that ipso facto means the film portrays 1969.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Even though I have great interest in Biblical movies, I was bored to death every minute of the movie. Everything is bad. The movie is too long, the acting is most of the time a Joke and the script is horrible. I did not get the point in mixing the story about Abraham and Noah together. So if you value your time and sanity stay away from this horror.
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Im a die hard Dads Army fan and nothing will ever change that. I got all the tapes, DVD's and audiobooks and every time i watch/listen to them its brand new. <br /><br />The film. The film is a re run of certain episodes, Man and the hour, Enemy within the gates, Battle School and numerous others with a different edge. Introduction of a new General instead of Captain Square was a brilliant move - especially when he wouldn't cash the cheque (something that is rarely done now).<br /><br />It follows through the early years of getting equipment and uniforms, starting up and training. All in all, its a great film for a boring Sunday afternoon. <br /><br />Two draw backs. One is the Germans bogus dodgy accents (come one, Germans cant pronounced the letter "W" like us) and Two The casting of Liz Frazer instead of the familiar Janet Davis. I like Liz in other films like the carry ons but she doesn't carry it correctly in this and Janet Davis would have been the better choice.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A terrible movie as everyone has said. What made me laugh was the cameo appearance by Scott McNealy, giving an award to one of the murdered programmers in front of a wall of SUN logos. McNealy is the CEO of SUN Microsystem, a company that practically defines itself by its hatred of Microsoft. They have been instrumental in filing antitrust complaints against Microsoft. So, were they silly enough to think this bad movie would add fuel to that fire?<br /><br />There's no public record I see of SUN's involvement, but clearly the makers of this movie know Scott McNealy. An interesting mystery.
##   label
## 1     0
## 2     0
## 3     0
## 4     0
## 5     1
## 6     0

1: Tidy text format

First, we tokenize the text into individual tokens with each individual token being a unit of word.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)

tokens <- data %>% unnest_tokens(output = word, input = text)
head(tokens)
##   label     word
## 1     0        i
## 2     0     grew
## 3     0       up
## 4     0        b
## 5     0     1965
## 6     0 watching

Next, we remove stop words.

library(ggplot2)

data(stop_words) #provided from tidytext
# remove stop words 
tidy_data <- tokens %>% anti_join(stop_words)
## Joining with `by = join_by(word)`
# find most common words used in reviews
tidy_data %>% count(word, sort = TRUE) %>%
  head(20) %>%
  # make new variable word
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word)) + geom_col() + labs(y = NULL)

Skimming the top 20 words used in these reviews, we see “br” is the most commonly used word. Upon closer inspect in the data set, “br” refers to html syntax “
” or “
” to indicate line breaks. Since this is not a valid word, we customly add this to our stop words to remove it from the data set.

# add br as stop word
custom_stop_words <- bind_rows(tibble(word = c("br"),  
                                      lexicon = c("custom")), 
                               stop_words)

tidy_data <- tokens %>% anti_join(custom_stop_words)
## Joining with `by = join_by(word)`
tidy_data %>% count(word, sort = TRUE) %>%
  head(20) %>%
  # make new variable word
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word)) + geom_col() + labs(y = NULL)

Knowing this is a data set of movie reviews, it makes sense for the words “movie” and “film” to be the most common words used.

2: Sentiment analysis

Using the NRC lexicon, we label each word with their assigned sentiment(s), and we plot the distribution from most to least. Note, this visualization is limited because words may have a multiple sentiments. Evaluating words individually, we see the most common sentiment is positive words, but negative as a runner up.

sent_data <- tidy_data %>% inner_join( get_sentiments("nrc"),
                                       by='word')
## Warning in inner_join(., get_sentiments("nrc"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 4 of `x` matches multiple rows in `y`.
## ℹ Row 10935 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
sent_data %>% count(sentiment, sort = TRUE) %>%
    mutate(sentiment = reorder(sentiment, n)) %>%
    ggplot(aes(n, sentiment)) + geom_col() + labs(y = NULL)

Let’s take a closer look into what kind of positive and negative (the top two sentiments) words are most common in the movie reviews. We can use a word cloud to better depict this.

library(reshape2)
library(wordcloud)
## Loading required package: RColorBrewer
tidy_data %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  # acast() turns the data frame into a matrix
  # The cast formula has the following format: x_variable + x_2 ~ y_variable + y_2 ~ z_variable ~ ... The order of the variables makes a difference. The first varies slowest, and the last fastest.
  # value.var is the name of column which stores values
  acast(formula = word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red3", "skyblue"), max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 321421 of `x` matches multiple rows in `y`.
## ℹ Row 414 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

3: Analyzing word and document frequency: tf-idf

To closer analyze what a document is about, we take a look into the word frequencies. We first verify Zipf’s Law holds, which states the frequency that a word appears is inversely proportional to its rank of frequency.

len <- dim(tidy_data)[1]

freq_by_rank <- tidy_data %>% 
  count(word, sort = TRUE) %>%
  mutate(rank = row_number(), 
         `term frequency` = n/len) 

head(freq_by_rank)
##     word     n rank term frequency
## 1  movie 69718    1    0.019800040
## 2   film 61989    2    0.017604990
## 3   time 20107    3    0.005710425
## 4  story 18380    4    0.005219954
## 5    bad 14734    5    0.004184483
## 6 people 14294    6    0.004059522
freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`)) + 
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() + scale_y_log10() 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The plot shows the movie reviews do follow a linear rank-frequency relationship. Next, considering every review as a separate document, we can analyze the tf-idf.

data$doc <- c(1: dim(data)[1] )

review_words <- data %>% unnest_tokens(output = word, input = text) %>% 
  count(doc, word, sort = TRUE)

total_words <- review_words %>% 
  group_by(doc) %>% 
  summarize(total = sum(n))

review_words <- left_join(review_words, total_words)
## Joining with `by = join_by(doc)`
head(review_words)
##     doc word   n total
## 1  5879  the 198  1855
## 2  6379  the 198  2493
## 3 34168  the 180  2331
## 4 26988   br 160  1004
## 5 20021  the 128  1732
## 6 11097   br 118   756
tf_idf <- review_words %>%
  bind_tf_idf(term = word, document = doc, n=n)

ordered_tf_idf <- tf_idf %>%
  select(-total) %>%
  arrange(desc(tf_idf))

head(ordered_tf_idf)
##     doc          word  n         tf       idf   tf_idf
## 1 18348 trivialboring 26 0.23008850 10.596635 2.438164
## 2 36010       primary  2 0.33333333  5.592688 1.864229
## 3  9153    smallville  3 0.20000000  7.888585 1.577717
## 4 14437    smallville  3 0.20000000  7.888585 1.577717
## 5 15400       stop.oz 23 0.11979167 10.596635 1.269389
## 6  7241        cognac 12 0.09302326  9.498022 0.883537

The most important word appear to be “trivialboring”, looking at the data set, the word appears to be copy and pasted 26 times in review number 18348. The term “stop.oz” seems to be important, but about closer examination at the document with the word, that review appears to have the sentence “OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop.” repeated 23 times, so that is not a good review. “Cognac” was also from a review that had the same sentence 12 times. Because of the corrections needed to be made on these results, id-tdf is probably not the most reliable method for finding important words in the data set.

4: Relationships between words: n-grams and correlations

We can also tokenize by n-grams, which are consecutive sequences of n words. Something to note is that tokenizing in this way will result in overlapping tokens.

movie_bigrams <- data %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram))

head(movie_bigrams)
##   label doc        bigram
## 1     0   1        i grew
## 2     0   1       grew up
## 3     0   1          up b
## 4     0   1        b 1965
## 5     0   1 1965 watching
## 6     0   1  watching and

To remove any bigrams with stop words and find the most common bigrams, we count the words not included in the stop words.

library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths
bigrams_separated <- movie_bigrams %>%
  # separate the bigram column in word1 and word2 columns separating by a space
  separate(col = bigram, into = c("word1", "word2"), sep = " ")

# remove stop words
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% custom_stop_words$word) %>%
  filter(!word2 %in% custom_stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

head(bigram_counts)
##     word1     word2    n
## 1 special   effects 1785
## 2     low    budget 1461
## 3     sci        fi 1146
## 4    real      life  998
## 5    main character  863
## 6  horror     movie  766
bigrams_united <- bigrams_filtered %>%
  unite(col = bigram, c(word1, word2), sep = " ")

head(bigrams_united)
##   label doc              bigram
## 1     0   1       1965 watching
## 2     0   1      school watched
## 3     0   1 played thunderbirds
## 4     0   1       alan counting
## 5     0   1            art form
## 6     0   1        movie hoping

Looking into negated words:

AFINN <- get_sentiments("afinn")

not_words <- bigrams_separated %>%
  filter(word1 == "not") %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word2, value, sort = TRUE)

not_words %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(n * value, word2, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  labs(x = "Sentiment value * number of occurrences",
       y = "Words preceded by \"not\"")

Not funny” appears to be the most negated positive word in these reviews, and “not bad” is the most negated negative word in reviews.

To visualize more pairings of words simultaneously, we can use a network of nodes.

library(igraph)
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(ggraph)

bigram_graph <- bigram_counts %>%
  head(80) %>%
  # Create igraph graph object from data frame
  graph_from_data_frame()

bigram_graph <- bigram_counts %>%
  head(80) %>%
  # Create igraph graph object from data frame
  graph_from_data_frame()

set.seed(7)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  # edge_alpha to make links transparent based on how common or rare the bigram is
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 # end_cap option tells arrow to end before touching next node
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  # add theme for plotting networks
  theme_void()
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

If we were interested in words that occur within the same documents but are not adjacent to each other, we can take the count and correlations of pairs of words.

library(widyr)

words <- data %>% unnest_tokens(output = word, input = text)

word_pairs <- words %>%
  filter(!word %in% custom_stop_words$word) %>%
  pairwise_count(item = word, feature = doc, sort = TRUE)

head(word_pairs)
## # A tibble: 6 × 3
##   item1 item2     n
##   <chr> <chr> <dbl>
## 1 film  movie 12306
## 2 movie film  12306
## 3 time  movie  8838
## 4 movie time   8838
## 5 time  film   8222
## 6 film  time   8222
word_cors <- words %>%
  filter(!word %in% custom_stop_words$word) %>%
  group_by(word) %>%
  filter(n() >= 1000) %>%
  pairwise_cor(word, doc, sort = TRUE)

head(word_cors)
## # A tibble: 6 × 3
##   item1   item2   correlation
##   <chr>   <chr>         <dbl>
## 1 fi      sci           0.983
## 2 sci     fi            0.983
## 3 effects special       0.534
## 4 special effects       0.534
## 5 budget  low           0.525
## 6 low     budget        0.525

We can graph these highly correlated words with the correlation as weights.

set.seed(2016)

word_cors %>%
  filter(correlation > .15) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

6: Topic modeling

Now, we will try to use topic modeling to group the reviews into movie genres the review was for. We first need to convert tidy format to a document-term matrix object that the topicmodels package takes as input.

library(topicmodels)

words_dtm <- words %>% 
  filter(!word %in% custom_stop_words$word) %>%
  count(doc, word, sort = TRUE) %>%
  cast_dtm(doc, word, n)

head(words_dtm)
## <<DocumentTermMatrix (documents: 6, terms: 107475)>>
## Non-/sparse entries: 1942/642908
## Sparsity           : 100%
## Maximal term length: 66
## Weighting          : term frequency (tf)

Next, we use Latent Dirichlet allocation (LDA) algorithm to estimate the mixture of words associated with k topics. We start with an arbitrary low k value of 4.

lda <- LDA(words_dtm, k = 4, control = list(seed = 1234))

topics <- tidy(lda, matrix = "beta")

head(topics)
## # A tibble: 6 × 3
##   topic term          beta
##   <int> <chr>        <dbl>
## 1     1 marty   0.0000447 
## 2     2 marty   0.0000197 
## 3     3 marty   0.00000967
## 4     4 marty   0.0000791 
## 5     1 titanic 0.0000356 
## 6     2 titanic 0.0000376
top_terms <- topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>% 
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

It appears that every topic includes the usage of “movie” and “film,” which isn’t too helpful in distinguishing topics. What if we didn’t include those two words in the clustering?

words_dtm <- words %>% 
  filter(!word %in% custom_stop_words$word) %>%
  filter(!word %in% c("movie", "movies", "film", "films")) %>%
  count(doc, word, sort = TRUE) %>%
  cast_dtm(doc, word, n)

lda <- LDA(words_dtm, k = 6, control = list(seed = 1234))

topics <- tidy(lda, matrix = "beta")

top_terms <- topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% 
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

We still see redundant terms for words sharing the same root, such as “acting” and “actor.” To discover more informative terms from these review, we map all words to their stem and perform the LDA algorithm on the stem words with the SnowballC package.

library(SnowballC)

stem_words <- words %>%
  filter(!word %in% custom_stop_words$word) %>%
  mutate(stem = wordStem(word))

stem_counts <- stem_words %>% 
  count(stem, sort = TRUE)

head(stem_counts)
##      stem     n
## 1    movi 81959
## 2    film 75018
## 3    time 25574
## 4   watch 22269
## 5 charact 22263
## 6   stori 20112

After running LDA with different k values from 2 to 10, the results for k = 6 topics looked the most distinct from each other.

words_dtm <- stem_words %>%
  filter(!word %in% c("movie", "movies", "film", "films")) %>%
  count(doc, stem, sort = TRUE) %>%
  cast_dtm(doc, stem, n)

lda <- LDA(words_dtm, k = 6, control = list(seed = 1234))

topics <- tidy(lda, matrix = "beta")

top_terms <- topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% 
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

We can also examine the per-document-per-topic probabilities (gamma), the estimated proportion of words from that document that are generated from that topic. We can then assign a topic to each review.

reviews_gamma <- tidy(lda, matrix = "gamma")
head(reviews_gamma)
## # A tibble: 6 × 3
##   document topic gamma
##   <chr>    <int> <dbl>
## 1 12896        1 0.173
## 2 15253        1 0.159
## 3 5879         1 0.152
## 4 27425        1 0.175
## 5 6894         1 0.157
## 6 3146         1 0.244

We could take the original document-word pairs and find which words in each document were assigned to which topic. The augment() function from the broom package is a way of tidying model output. We see the output is a tidy data frame with review-word counts with the topic each word was assigned to.

assignments <- augment(lda, data = words_dtm)
head(assignments)
## # A tibble: 6 × 4
##   document term  count .topic
##   <chr>    <chr> <dbl>  <dbl>
## 1 12896    marti    36      3
## 2 1147     marti     1      1
## 3 36627    marti     1      3
## 4 2962     marti     1      3
## 5 13264    marti     1      3
## 6 26194    marti     1      3
#count how many reviews for each topic
topic_counts <- assignments %>% 
  group_by(.topic) %>%  
  summarise(sum_count = sum(count),
            .groups = 'drop')

topic_counts
## # A tibble: 6 × 2
##   .topic sum_count
##    <dbl>     <dbl>
## 1      1    567320
## 2      2    551541
## 3      3    538637
## 4      4    629460
## 5      5    530300
## 6      6    548846
topic_counts %>%
  ggplot(aes(x=.topic, y=sum_count)) +
  geom_bar(stat="identity")

There appears to be about equal amount of reviews for each topic. To get a better idea of what genres these topics correspond to, we can further look into the words of each topic.genre to see what kinds of sentiments are expressed in these reviews. One thing to note is by using the BING lexicon to analyze word sentiments, there are many words not categorized, so there is a significant drop in data entries, but the most commonly used words should still show. Using wordclouds to visualize the most prominent words, the hypothesized genres of each topic are:

# assign sentiment to each term in assignments, note row count drops from 2,784,478 to 363,404 as there are many words in the reviews not included in the lexicon.
assign_topics <- assignments %>% inner_join(get_sentiments("bing"), by=c('term'='word'))

gen_cloud <- function(topic) {
 topic_sub <- subset( assign_topics , .topic == topic )
 topic_sub %>% group_by(term) %>% 
   summarise(count = sum(count)) %>%  
   inner_join(get_sentiments("bing"), by=c('term'='word')) %>%
   acast(formula = term ~ sentiment, value.var = "count", fill = 0) %>%
   comparison.cloud(colors = c("red3", "skyblue"), max.words = 100)
}

Topic 1: dystopian/science fiction

gen_cloud(1)

Topic 2: drama

gen_cloud(2)

Topic 3: romantic-comedy

gen_cloud(3)

Topic 4: horror

gen_cloud(4)

Topic 5: war/historical fiction

gen_cloud(5)

Topic 6: comedy/action

gen_cloud(6)

Another limitation of this survey is the lack of method to check if our assigned genres matched with the actual genre of the movie the review was about. The original data set included a binary “label” column, but concluded it is not very informative, because there are realistically more than two types of movie genres. Because this is an unsupervised learning analysis, we can only make inferences based on our results but can’t check the accuracy. Overall, these results show this methodology can be used to an extent to predict genre.